2024-08-15
If you don’t already have it, go to https://cran.r-project.org/ and install an appropriate version of R
You can install R-Studio from here: https://posit.co/download/rstudio-desktop/
Write commands in the source window (and save them when you’re done)
View output and results in the console
See what you’ve loaded or stored in the environment window
View and export plots from the output window (and manage packages from the packages tab)
Type the following in the source window:
Click your cursor on the top line and press CTRL + ENTER (CMD + Enter on a Mac). Your cursor should move down and the code will be evaluated. Take note of what happens and where things show up after executing each line.
[1] -0.240992979 0.290562989 -1.260498177 1.778242627 0.083370526
[6] 0.258915414 -1.825827060 0.132767917 1.523830743 -0.544133120
[11] -0.605523500 -0.585982563 -0.390377347 -0.142706939 0.553274671
[16] -0.934905301 -0.173927974 0.503253935 0.662751649 -0.534372481
[21] 0.148631025 -2.930025897 0.831300097 -2.355958782 0.416207299
[26] 0.691573492 -0.240323687 0.810977355 0.589290847 0.713681569
[31] 1.421196834 -0.377921153 -2.543555214 0.432781761 -0.917816342
[36] -0.872884122 0.839778670 0.805428976 -1.544391285 -0.558554294
[41] -0.213700538 -1.618569103 -0.819063679 -1.498673386 0.936998898
[46] -1.837256036 0.168963175 0.370003753 1.851666244 -1.158008673
[51] -0.263827759 -0.520597232 -0.683707671 -1.775524902 0.843431512
[56] -0.772280686 0.747986703 0.910515149 2.532081179 -0.306049335
[61] -1.233469504 0.152193449 -0.774905289 -0.068789225 -0.259711362
[66] -0.502788487 -0.926743516 -1.940149469 -0.141599425 -0.437935021
[71] 1.482654985 1.798253668 -0.922550750 0.301313185 -1.411908753
[76] 0.019610452 -0.046388730 0.932180054 -1.375700444 -0.874476382
[81] 1.460260837 0.270250046 -1.629444847 -0.409397148 0.795329840
[86] -0.397326760 -0.407821040 -1.023481050 -0.041318974 -1.899838769
[91] -1.213641906 -0.109380682 -0.223184960 1.499318949 1.022242792
[96] 0.106967086 -0.005171813 -0.784828636 0.605746452 -1.001586137
If I just reference the variable x, R will print its contents into the console.
Now, with the same script open, press CTRL + SHIFT + S
What happened?
Now: save the current script with a name like apan_example.R. Then run:
Do the following:
Close R-studio
When asked if you would like to save the workspace Image, click “No” (and ideally never click “Yes” ever from now until the end of time)
Re-open R studio
Try to run just the last line of the code again:
What happened?
R packages extend the basic functionality of R so you can do more stuff more easily.
To use a package, you first need to:
Install it (just once!) using the install.packages("packagename") function or through the R-Studio interface
Then you need to load it using the library(packagename) command each time you open R.
Install the Snake package by running (note the quotation marks):
(Or use the graphical interface)
Load the Snake package by running (note the lack of quotation marks)
Now you can use commands from the Snake package. Try running this:
Admittedly this is not a typical use-case for this tool…
You can use the Import Dataset menu to import data into R with point-and-click commands OR by writing code.
We’ll walk through both approaches using this data set as an example:
https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv
Which contains data from the article “A Statistical Analysis of the Work of Bob Ross“ by Walt Hickey.
After loading the data with point and click commands, we should copy the code that produced it and save it in our script, that way, we can easily replicate or share our analysis by just re-running the script.
# load the readr package
library(readr)
# use the read_csv function to read the data
elements_by_episode <- read_csv(
"https://raw.githubusercontent.com/fivethirtyeight/data/master/bob-ross/elements-by-episode.csv"
)
# optionally, use View() to examine the data in the GUI
#View(elements_by_episode)library function. The readr package is what allows us to use the read_csv function, which is not a built-in command.
View() function is handy, but it can be kind of annoying because it makes a pop-up, so I placed a # in front of this command to turn it into a comment.
Click the data set name in the environment window to view it. See if you can answer:
What is the unit of analysis?
How many observations are there (rows)? How many variables (columns)? How are things represented?
In the source window, start typing in the name of the data set (elements_by_episode) what happens?
We’ve already used the <- operator to store a value and reference it by name. We call this process assignment.
Assignment is very flexible. You can change the value of an existing variable, you can make a copy with a new name, and you can even combine multiple variables into one or add them together.
Variable names must start with either a letter or a period. They usually can’t contain spaces or certain special operators like + or -, but you can use these characters if you wrap the name in ` symbols:
A good practice is to give descriptive names to variables and use underscores (_) in place of spaces.
In addition to storing data, R variables can have additional attributes and classes that impact how they’re stored, modified, or used in functions.
One of the most fundamental attributes is a variable’s mode, which is how R knows how to do things like distinguish numbers from text.
We’ve already seen numeric data. But two other very common ones are:
character which is used for storing text and can be created by entering values inside quotation marks.
logical which can take values of either TRUE or FALSE and can be created directly with those value OR by writing out a logical comparison like 3 > 4
What happens when you run this? And why?
Data structures allow us to store and perform calculations on entire sets of numbers or text. The ones we’ll see most often are vector, matrix, data frame and list.
vectors store multiple elements of the same type. You can create a vector by passing a comma-separated list to the c() function.
The elements of a vector must share a type. If they don’t, then R will “coerce” each element to make them conform.
We can use the [] operator to access specific elements of a vector. For instance, I can get the 2nd element of this vector by writing:
We can also use vectors to subset other vectors
And we can use a logical comparison to subset.
This comparison creates a logical vector with TRUE for each element where the expression is TRUE
(note the double equals sign == is meant to distinguish comparison from assignment.)
Elements of a vector can also be named. In those cases, you can also subset by referencing the name in quotation marks
Finally, you can also assign values to specific elements of an object using a subsetting operator with <-
[1] 1 2 3 4 5
[1] 1 2 3 10 10
A matrix has a single type of data arranged in a fixed number of rows and columns. Here’s a matrix with 3 columns and 5 rows.
“Under the hood” a matrix is really just a vector with some extra attributes, so I can subset it just like I would a vector:
You can distinguish a matrix from a vector by checking its class
It usually makes more sense to take an entire row or column from a matrix. To do that, I can use syntax like this:
Matrices can have both row and column names, and these can also be used for subsetting
Data frames have rows and columns like a matrix, but:
data$colname notation. (but the matrix subsetting operators also work!)Usually things will just “become” data frames when you import them, but you can also make one yourself with the data.frame() function.
Use the $ operator to access entire columns (but not rows!)
Or you can use double brackets followed by a column name in quotation marks
You’re likely to encounter data frames more than any other type of data. However, many statistical operations will coerce your data to a vector or matrix before actually conducting the analysis.
A tibble() is, for all intents and purposes, the same thing as a data frame, but they have fewer weird behaviors.
Detailed documentation for data frame subsetting and assignment can be found here: here
Lists are like data frames without any of the restrictions1. They can contain any number of types and can even contain other lists or data frames:
mylist = list(
"letters" = c("A", "B", "C"),
"scalar" = 10,
"nested_list" = list(
"palette_1" = list("red", "blue", "green", "white"),
"palette_2" = list("pink", "brown", "black")
)
)
mylist$letters
[1] "A" "B" "C"
$scalar
[1] 10
$nested_list
$nested_list$palette_1
$nested_list$palette_1[[1]]
[1] "red"
$nested_list$palette_1[[2]]
[1] "blue"
$nested_list$palette_1[[3]]
[1] "green"
$nested_list$palette_1[[4]]
[1] "white"
$nested_list$palette_2
$nested_list$palette_2[[1]]
[1] "pink"
$nested_list$palette_2[[2]]
[1] "brown"
$nested_list$palette_2[[3]]
[1] "black"
List subsetting works more-or-less like subsetting a data frame, except they don’t have rows or columns so the matrix-style notation won’t work.
And you can also access parts of a nested list by combining subset operators:
Use the <- operator to create a named object in R that can be re-used elsewhere in a script.
R objects have a mode that reflects the kinds of data they store. Common modes are numeric, logical, character and factor
Data can be grouped together in structures like vector, matrix, data.frame and list, each with different restrictions and behaviors.
You can use [, [[ and $ operators to extract specific elements of an R object.
| Function | Description |
|---|---|
str() summary() |
Both of these functions give some summary information about a complex object. str() tells you more useful information about the structure, whereas summary is more likely to give statistical summaries like mean and mode. |
mode() |
Gives information about the kind of data (numeric, character, etc) that an object can hold |
class() |
Tells you the class of an R object. (Classing is what allows a function like summary to behave differently depending on what kind of object you give it) |
View() |
Brings up a spreadsheet-style view of an R object (you can also execute this function in R-studio by clicking the object name in the environment pane) |
| Function | Description |
|---|---|
subset() |
Returns rows of a data frame that match a logical expression. (this does the same stuff that [] does but can be more readable) |
which() |
Returns the index of elements that match a logical statement. |
[],[[]], $ |
Access an object by index or name. $ only works with lists and data frames and can only return one named element (i.e. a single column of a data frame) |
: |
Returns a sequence of numbers from i:j (i.e. 1:3 returns 1,2,3). You can use this with [ to do things like retrieve the first 3 columns of a matrix or data frame. |
names(), colnames(), rownames(),dimnames() |
Get or set names on an object or dimension. |
ncol(), nrow(), length() |
Get the number of rows, columns, or the total length of an object |
Using the Bob Ross data frame from earlier:
How many episodes were there where Ross painted trees? How could you make a vector with the titles of these episodes?
The “GUEST” column contains a 1 if an episode had a guest host and 0 otherwise. How would you remove all of the guest-hosted episodes?
How would you remove the title and episode number from the data frame?
How would you find all of the elements he painted in episode 1?
Advanced users:
find all the episodes where Ross only painted a single tree.
make a list with only the non-zero elements from each episode.
Functions are basically just blocks of generalized code that can be used over and over again. We’ve already used several:
c() concatenates values into a vectormatrix() turns a vector into a matrixdata.frame() turns a series of equal-length vectors into a data framemode() tells you the type of a particular kind of dataWe’ve also used operators like + and -. In actuality, these are also a kind of R function.
“Infix” functions are so obvious you probably don’t even think of them as functions. Try typing this into the script editor and then send it to R:
Infix operators will always take a left hand side (LHS) argument and a right hand side argument (RHS)
LSH + RHS
Arithmetic
| Operator | Usage | Example |
|---|---|---|
+, -, /, * |
plus, minus, divide, and multiply, respectively | 1 + 12 /3 |
%%, %/% |
Modulo division and integer division |
|
^ |
Raise the left hand side to the power of the right hand side | 3 ^ 4 |
== |
test for equality |
|
!= |
test for inequality | "Horse" != "Donkey" |
>, <, >=, <= |
greater than, less than, geq, leq |
|
& | |
logical AND, and OR (multiple logical conditions) | 10 > 5 & 11 < 12 |
%in% |
check if values of LHS are present in RHS | 3 %in% c(1, 3, 10) |
<-, = |
assign RHS to LHS ( <- is preferred over the equals sign because its less likely to be confused for a comparison) |
x <- 13 |
: |
make a sequence of numbers from LHS to RHS | 1:10 |
~ |
Used in model formulas (like when we want to estimate the effect of a variable on an outcome) | lm(speed~dist, data=cars) |
Most R functions will use prefix notation. Use a prefix function by first writing the name of the function, followed by parentheses with some arguments inside.
The help function is a special function that brings up information about functions. Run this and see what shows up in the bottom right window of R-Studio.
Lets take a closer look at the Default S3 method here for the mean() function
x is the main input.
The part about Default S3 method: is not especially important here, but some functions will have different methods depending on the class of data (check out the help file for print for instance)
two remaining arguments (trim and na.rm) have a name AND default values. So if we don’t specify an argument, it will assume this is what we want.
The ellipses allows additional arguments to be included. This is
Note
R is pedantic about math! There really isn’t a valid way to calculate the mean with NA values, and NA is supposed to be a placeholder, so R doesn’t want to ignore it by default. So you’re forced to be explicit.
One more thing to note here is that we can give the name each argument explicitly, but we don’t have to:
Now try without specifying the name x:
If we don’t name the arguments, R will assume they’re being given in the order specified in the help file. This can cause problems if we get things out of sequence. Which is why this is okay:
You can create your own functions. To create a function, you’ll use the following general syntax:
Now execute your function just like you would any other R function.
Note: Many built-in functions like sum() will be available as soon as you load R and Package functions will be imported when you load the package. but User-created functions will just be loaded into the global environment, so they go away when you close R.
A more practical example: R doesn’t have any built-in function for calculating the standard error of a continuous variable. But we can make one:
help() to get information about an R function.Find the top 10 most common elements Bob Ross painted and create a bar plot showing the frequency of each.
Start by writing an outline with all the steps you need to complete:
# Step 0. import the data (you should already have the code for this) ----
# Step 1. remove the columns with episode numbers or titles ----
# Step 2. get the sum of the numeric columns (there's a way to do this with just one function!) ----
# Step 3. sort the sums from highest to lowest ----
# Step 4. view the top 10 elements from this sorted result ----
# Step 5. create a bar plot from the object created in step 4 ----
# if you're already comfortable with the above: ----
# - Get a count of the number of each element used per season
# - find exact duplicates (episodes with the exact same elements)
# - find correlations between elements (and maybe plot them)Error messages are informative! Even if they seem inscrutable.
When something doesn’t work, stop, take a breath, and then read the output.
Consult the help files
Search the internet! Few problems are unique and R has a lot of users.
Its pointless to try to memorize everything. Instead, try to follow good script-writing practices
Give objects meaningful names
Write lots of comments
Save your scripts (also with meaningful names)
Use the GUI to figure stuff out, but write the equivalent commands in your script
Don’t try to tackle complex problems in one fell swoop. Take a couple of observations, figure out the correct answer, and then try to write code that gets you there. Then generalize that case to the entire data set.
Talk to yourself: use the comments to outline what you will do before you do it.
Ask for help from your classmates or instructors!
You’ll get better help if you:
share enough code and data so that someone could replicate your problem, or at least a small version of it
let them know what you’ve already tried
share the error messages that show up in the console, or - if there aren’t any errors - explain what’s wrong with your results